Incorporating Nesterov Momentum into Adam
نویسنده
چکیده
When attempting to improve the performance of a deep learning system, there are more or less three approaches one can take: the first is to improve the structure of the model, perhaps adding another layer, switching from simple recurrent units to LSTM cells [4], or–in the realm of NLP–taking advantage of syntactic parses (e.g. as in [13, et seq.]); another approach is to improve the initialization of the model, guaranteeing that the early-stage gradients have certain beneficial properties [3], or building in large amounts of sparsity [6], or taking advantage of principles of linear algebra [15]; the final approach is to try a more powerful learning algorithm, such as including a decaying sum over the previous gradients in the update [12], by dividing each parameter update by the L2 norm of the previous updates for that parameter [2], or even by foregoing first-order algorithms for more powerful but more computationally costly second order algorithms [9]. This paper has as its goal the third option—improving the quality of the final solution by using a faster, more powerful learning algorithm.
منابع مشابه
Don't Decay the Learning Rate, Increase the Batch Size
It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but wi...
متن کاملAcceleration of Gradient-based Path Integral Method for Efficient Optimal and Inverse Optimal Control
This paper deals with a new accelerated path integral method, which iteratively searches optimal controls with a small number of iterations. This study is based on the recent observations that a path integral method for reinforcement learning can be interpreted as gradient descent. This observation also applies to an iterative path integral method for optimal control, which sets a convincing ar...
متن کاملOnline Learning Rate Adaptation with Hypergradient Descent
We introduce a general method for improving the convergence rate of gradientbased optimizers that is easy to implement and works well in practice. We demonstrate the effectiveness of the method in a range of optimization problems by applying it to stochastic gradient descent, stochastic gradient descent with Nesterov momentum, and Adam, showing that it significantly reduces the need for the man...
متن کاملThree-cocycles, Nonassociative Gauge Transformations and Dirac’s Monopole
In 1931 Dirac [1] introduced a magnetic monopole into the quantum mechanics and found a quantization relation between an electric charge e and magnetic charge q, 2μ = n, n ∈ Z, where μ = eq, and ~ = c = 1. One of the widely accepted proofs of the Dirac selection rule is based on group representation theory (see, for example, [2, 3, 4, 5, 6]). In the presence of the magnetic monopole the operato...
متن کاملImproved Stochastic gradient descent algorithm for SVM
In order to improve the efficiency and classification ability of Support vector machines (SVM) based on stochastic gradient descent algorithm, three algorithms of improved stochastic gradient descent (SGD) are used to solve support vector machine, which are Momentum, Nesterov accelerated gradient (NAG), RMSprop. The experimental results show that the algorithm based on RMSprop for solving the l...
متن کامل